Support for NVML counters: GPU energy, instant GPU power, avg memory power#525
Conversation
|
@mpatrou Thank you for posting this PR, I will take a look through it. |
| nvmlReturn_t DECLDIR nvmlDeviceGetMemoryInfo(nvmlDevice_t, nvmlMemory_t *); | ||
| nvmlReturn_t DECLDIR nvmlDeviceGetPerformanceState(nvmlDevice_t, nvmlPstates_t *); | ||
| nvmlReturn_t DECLDIR nvmlDeviceGetPowerUsage(nvmlDevice_t, unsigned int *); | ||
| nvmlReturn_t DECLDIR nvmlDeviceGetTotalEnergyConsumption(nvmlDevice_t, unsigned long long *); |
There was a problem hiding this comment.
As a note from nvml.h, this function is supported from Volta and up.
There was a problem hiding this comment.
However, I tested on Voltar at Oregon with a P100 and it didn't seem to have an issue.
src/high-level/papi_hl.c
Outdated
| verbose_fprintf(stdout, "PAPI-HL Info: The event \"%s\" will be stored as instantaneous value.\n", requested_event_names[i]); | ||
| } | ||
|
|
||
| // except from nvml energy_consumption delta |
There was a problem hiding this comment.
Q: An above comment mentions that all nvml events will be instantaneous values. Why for energy_consumption and gpu_inst_power would this need to be changed to delta and the new event_type average respectively?
There was a problem hiding this comment.
As when I use your updated papi_hl.c the value of energy_consumption is zero. Which for the master branch does not occur and I get a non-zero value.
There was a problem hiding this comment.
Regarding the first question.
- Energy consumption is a cumulative counter. When we use papi_hl API to regions of code, we would need to measure the counter before and after the code and report the difference as the energy consumed to execute that region of code; thus the delta calculation.
- The gpu_inst_power coule be benefit from an average calculation for a region, instead of reporting one value at the end of the region to give a more detailed information.
P.S. Thank you so much for your very detailed review! I am going though your comments and making changes locally!!
There was a problem hiding this comment.
For the second comment: Did you try to profile some code that runs over 1ms or so? to give enough time to get 2 different energy values before and after the code execution.
There was a problem hiding this comment.
For the second comment: Did you try to profile some code that runs over 1ms or so? to give enough time to get 2 different energy values before and after the code execution.
I added a sleep of 10 seconds and did end up getting output for it. Thank you for pointing that out.
There was a problem hiding this comment.
Regarding the first question.
Energy consumption is a cumulative counter. When we use papi_hl API to regions of code, we would need to measure the counter before and after the code and report the difference as the energy consumed to execute that region of code; thus the delta calculation.
The gpu_inst_power coule be benefit from an average calculation for a region, instead of reporting one value at the end of the region to give a more detailed information.
I understand, a current workaround that I thought of was the following workflow:
PAPI_hl_region_begin()
PAPI_hl_read() // Get your initial energy
// Kernel launch/Work
PAPI_hl_region_end()The issue I see with this option, is you would then have to manually or programmatically grab the appropriate values and then take the difference.
Could you split this PR into two (I also can if you are currently busy with other work just let me know)? This PR can contain the updates to the NVML component and then the new PR would be updates just for papi_hl.c.
There was a problem hiding this comment.
The approach we have on the PR, helps to profile the energy information of applications with kokkos and kokkos-tools papi connector: https://github.com/kokkos/kokkos-tools/tree/develop/profiling/papi-connector, with no change on the kokkos side. (as a side-note, for the reason we took that route)
Sure, no problem. I can split the PR . I'll move the updates for the papi_hl.c to another one.
src/high-level/papi_hl.c
Outdated
| /* get cycles for last component */ | ||
| retval = PAPI_read_ts( _local_components[i].EventSet, _local_components[i].values, &_local_cycles ); | ||
| } | ||
| retval = PAPI_read( _local_components[i].EventSet, _local_components[i].values); |
There was a problem hiding this comment.
Q: In the PR description you mentioned that "internal_hl_read_counters that wasn't reading the counters for all components in the machines we tested". What exactly was your setup for this i.e. PAPI configure, system you were on, exported PAPI_EVENTS, etc.
I used the master papi_hl.c with this branch for the three new nvml events you added and all three show with values in the .json I end up creating.
Treece-Burgess
left a comment
There was a problem hiding this comment.
Final testing was done on Methane at ICL (1 * A100) and Athena at Oregon (4 * A100).
In both testing cases ./configure --prefix=$PWD/test-install --with-components="nvml" --with-debug=yes was utilized.
Methane at ICL (1 * A100)
With Cuda Toolkit 12.9:
papi_component_avail- ✅ (count was updated from 27 in the master branch to 30 in this branch)papi_native_avail- ✅ (showed the three new nvml native events added in this branch)papi_command_line- ✅ (all three of the new nvml native events were able to be added and collect counter values for)HelloWorld.cu- ✅ (for all three of the new nvml native events the test ran successfully)
Athena at Oregon (4 * A100)
Testing on this machine was only for compilation purposes to make sure that the #if defined's worked properly. To do this I used Cuda Toolkit 12.0 as NVML_POWER_SCOPE_GPU and NVML_POWER_SCOPE_MEMORY do not exist. Results are:
- PAPI build: ✅
papi_native_avail- ✅ (it did not show the events forgpu_memory_avg_powerandgpu_inst_poweras expected)papi_command_line- ✅ (ran successfully withtotal_energy_consumption)
…ded, papi_hl delta and average calculations added, issue with components fixed
5ca3633 to
4b26585
Compare
Pull Request Description
Sibling PR (1/2) split into #529
The code
Author Checklist
Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
Commits are self contained and only do one thing
Commits have a header of the form:
module: short descriptionCommits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
The PR needs to pass all the tests